Intro

Below is:

  • A summary of key points that cropped up again and again.
  • A set of proposed readings by topic

Note that these notes haven't been edited or extended with audio recordings I've made, and some need updating with significantly more material. See the "Proposed Readings by Topic* section or ask me for more details.

Summary

  • Both the financial and academic worlds are increasingly adopting Python for the same reasons.

    • They're often encumbered with extremely large, heterogenous, legacy systems.
      • Old work isn't discarded. Incremental additions, re-use over interfaces.
      • Think COBOL/Excel/VBA for finance, FORTRAN/C in academia.
    • They're both seeking one paradigm as an end-to-end high-level solution for all users.
    • Neither can sacrifice performance, yet are finding the time-to-market and development lifecycles too long with legacy homogenous systems.
    • Python has long had a reputation for being unable to deliver performance, but this is now commonly acknowledged to no longer be true.
      • Python easily serves as a glue to high performance wrappers such as NumPy and SciPy, file formats such as HDF5, and heterogenous computation backends such as shared-memory parallelism (SMP) (OpenMP via Cython), GPUs (CUDA) and FPGAs (OpenCL).
      • Achieving C/C++ performance can be done with significantly simpler code and designs, yet requires sophisticated knowledge of memory cache hierarchies, disk I/O patterns, and SMP issues.
    • The financial industry has long relied on Python as an interface to core, high-performance components, but are paranoid and extremely closed and the only way knowledge gets shared is by stealing employees from other financial companies.
    • More information:
  • IPython Notebook is universally used and loved.

  • No clear future for visualisations in Python.

    • matplotlib is universally used for publication-quality charts. API is difficult to use but powerful and well engineered.
    • It's clear that web-based visualisations are the future, and very important even for publications.
    • In order to reach the browser it's also acknowledged that JavaScript is the ideal interface, rather than static images.
    • But how to reach browser? Many different perspectives:
      • IPython Notebook - use matplotlib magic incantation to draw charts, no interactivity, no JavaScript.
        • Other libraries build on top of matplotlib of course work just as well: ggplot, seaborn, prettyplotlib
      • "04 - Visualisations in Bokeh": people love ggplot in R because of the Grammar of Graphics, and people love Python because it's a one-stop stop
        • So use Python with a ggplot-like grammar to auto-generate HTML5-canvas backed web visualisations using JavaScript.
        • HTML5-canvas is an investment and should reap rewards over SVG-based libraries like d3.js for very complex visualisations.
      • "12 - Getting it out there - Python-JS-web-viz": forget Python, just code front-end in JavaScript and defer back-end and data cleaning to Python.
        • d3.js, nvd3, crossfilter, rickshaw, ...
      • Lightning talks
        • One presenter uses Python over a websocket bridge to Angular.js to create an RShiny-type interactive chart environment.
        • Another presenter showed off IPython version 2 (coming end of April 2014) functionality with interactive widgets, dynamically recreating charts based on user input.
    • There is no clear conclusion, except matplotlib is fantastic work and stood the test of time.
    • Bokeh seems very exciting but rough around the edges with a large and difficult to install set of dependencies, worth exploring the tutorials in full (!!AI which I will, and post a new article).
  • Cython is almost universally used, but more agile methods are being sought

  • Everyone uses scikit-learn

  • MapReduce/clusters have less hype and traction than you'd expect

    • Certainly there are some who use it for their data processing pipeline, e.g. "07 - Hierarchical Text Clustering in Python and Hive".
    • Given a large data set that cannot fit onto one disk, prefer to create large RDBMS clusters. See:
    • Given a large data set that cannot fit into memory prefer to use e.g. HDF5 to disk-back it, or create additional abstractions on top of NumPy/HDF5, a la "23 - Manipulating massive disk-backed arrays".
    • scikit-learn core contributors strongly prefer shared-memory parallelism to clusters, and are actively creating OpenMP-style abstractions (with better debugging and NumPy array performance).
    • Fantastic lightning talk on end used FireDrake to easily switch computing backend from SMP to GPU, but again no mention of clusters.

Proposed readings by group

Culture, industry background

Case studies

Technical - software engineering

Technical - mathematical

Technical - other

Visualisations


In [23]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[23]:

In [19]:
%autosave 10


Autosaving every 10 seconds